flowchart LR A[Audio Input<br/>Voice signal] --> B[Feature Extraction<br/>MFCCs / Spectrograms] B --> C[Acoustic Model<br/>ML/DL-based] C --> D[Language Model<br/>Contextual Prediction] D --> E[Decoder<br/>Final Text Output]
Omkar Ninav
June 30, 2025
Audio Input
Voice signal captured from a microphone or file
Feature Extraction
Converts raw audio into numerical features (e.g., MFCCs, spectrograms)
Acoustic Model
Maps features to phonemes or characters using ML/DL models
Language Model
Predicts word sequences for context-aware transcription
Decoder
Aligns acoustic and language models to produce final text
flowchart LR A[Audio Input<br/>Voice signal] --> B[Feature Extraction<br/>MFCCs / Spectrograms] B --> C[Acoustic Model<br/>ML/DL-based] C --> D[Language Model<br/>Contextual Prediction] D --> E[Decoder<br/>Final Text Output]
HMMs + GMMs
Early models for mapping acoustic features to phonemes
n-gram Language Models
Predict word sequences based on prior word probabilities
Feature Extraction (MFCCs)
Converts raw audio into spectral features
RNNs, CNNs, Transformers
Deep models for sequence modeling and attention
End-to-End Models (CTC / Attention)
Learn transcription directly from audio input
Trends
| Model | Accuracy | Offline | Multilingual | Ease of Use | Cost |
|---|---|---|---|---|---|
| Whisper | ✅✅✅ | ✅ | ✅✅✅ | ✅✅ | Free |
| Wav2Vec 2.0 | ✅✅ | ✅ | ⚠️ (mostly English) | ✅ | Free |
| Google API | ✅✅✅ | ❌ | ✅✅✅ | ✅✅✅ | Paid |
| DeepSpeech | ✅ | ✅ | ❌ | ✅✅ | Free |
| Kaldi | ✅✅ | ✅ | ✅ (with effort) | ⚠️ Complex | Free |
| Vosk | ✅✅ | ✅ | ✅✅ | ✅✅✅ | Free |
| Model | Cost | Offline Use | Cloud Required |
|---|---|---|---|
| Whisper | Free (Open Source) | ✅ | ❌ |
| Wav2Vec 2.0 | Free (Open Source) | ✅ | ❌ |
| DeepSpeech | Free (Open Source) | ✅ | ❌ |
| Kaldi | Free (Open Source) | ✅ | ❌ |
| Vosk | Free (Open Source) | ✅ | ❌ |
| Service | Approx. Cost | Offline Use | Cloud Required |
|---|---|---|---|
| Google Speech API | ~$1.44 per hour | ❌ | ✅ |
| AWS Transcribe | ~$1.44 per hour | ❌ | ✅ |
| Azure Speech | ~$1.60 per hour | ❌ | ✅ |
✔️ Recommendation:
Use Whisper or Wav2Vec 2.0 for local, cost-effective transcription
Use Vosk for light, multilingual offline STT (e.g. edge devices)
Use Cloud APIs only for real-time or highly multilingual needs
Chosen Model: openai/whisper-medium
whisper-large: Higher accuracy but heavier (more VRAM)whisper-tiny / base: Faster, but less accurate📌 Original:
> At the outset, this being the first session, it is very important to give an overview of the course. This course is spread over 12 weeks and we may have 30 hours of teaching involved in this. Let me also introduce you to the objectives of this course so that the intentions become clearer.
🌀 Whisper Output:
> At the outset, this being the first session, it is very important to give an overview of the course. This course is spread over 12 weeks and we may have 30 hours of teaching involved in this and we also introduce you to the objectives of this course so that the intentions become more clear.
🔹 ✅ Core content retained
🔹 ⚠️ Minor stylistic variation – rephrasing & joined sentences
📌 Original:
> Welcome back to the lectures on Engineering Mathematics-I.
> Today, we will learn Taylor’s Polynomial and Taylor Series.
🌀 Whisper Output:
> Hi, welcome back to the lectures on Engineering Mathematics I and today’s we will learn Taylor’s polynomial and Taylor series.
🔹 ✅ Core message preserved
🔹 Minor changes:
- Added “Hi”
- “today’s we” instead of “today we”
- Punctuation differences only
📌 Original:
> The polynomial of degree 0 will simply be 1.
> If we plot this, it’s the green line through the point (0, 1).
🌀 Whisper Output:
> So, the polynomial of degree 0 will be simply 1 and if we plot this. So, this is the green plot here of the exponential function and this polynomial of degree 0 is just a constant line. So, the straight line going through this 0 1 point.
🔹 ✅ Richer phrasing from audio
🔹 Whisper added spontaneous repetitions and fillers (“so”, “here”)
🔹 All technical meaning retained
✔️ Recommended:
Use Whisper for secure, offline transcription
Use cloud APIs only where real-time & scalability are critical
Questions? Suggestions?
🔗 View this presentation on GitHub:
Presented by Omkar Ninav — June 2025
Internal Use Only | Not For Redistribution